[Neo4j 기초] CSV 파일 임포트해오기

데이터베이스 Database/그래프DB_Neo4j

[Neo4j 기초] CSV 파일 임포트해오기

H 에이치 2023. 2. 21. 20:26

CSV 파일 임포팅

데이터 임포트 방식

LOAD CSV
- Cypher의 built-in clause를 사용해 LOAD CSV를 통해 CSV 파일을 import 할 수 있음
APOC library - JSON, XML
- JSON, XML 파일을 불러오기 위해서는 APOC library를 사용함
- CSV 파일도 APOC으로 불러올 수 있음
- 이 경우에는 Cypher 코드 없이 불러올 수 있음

Neo4j 지원 데이터 타입

그래프에 활용이 가능한 데이터 타입은 아래와 같음
- String
- Long (integer values)
- Double (decimal values)
- Boolean
- Date/Datetime
- Point (spatial)
- StringArray (comma-separated list of strings)
- LongArray (comma-separated list of integer values)
- DoubleArray (comma-separated list of decimal values)
- 리스트 타입
  - 로우를 리스트로 저장해야 하는 경우, String(default)으로 임포트 후 후처리가 필요함함

CSV 파일 임포팅

준비

필드명을 정의하는 헤더 ✨
또는 각 로우별로 필드 정보를 정의하는 Delimiter
확인사항
- terminator
  - 기본적으로 comma (,)
  - LOAD CSV claus에서 FIELDTERMINATOR 수정해야 함
- delimiters, quotes, and special characters
각 노드는 Unique Key를 가져야 함

주의사항

Field
- 기본적으로 string types으로 읽어짐
- 다중값 필드 구분자(delimitor) 확인
- 끝에 공백 없어야 함

LOAD CSV WITH HEADERS 
FROM 'https://data.neo4j.com/importing/ratings.csv' 
AS row 
RETURN count(row)

에러 해결

CSV file 준비
- 로컬 시스템에 저장해둬야 함
- 헤더가 있어야 함
- 데이터가 클린해야 함
  - quotes
  - empty string
  - UTF-8 prefixes used (for example \uc)
  - trailing spaces
  - binary zeros
  - obvious typos
- Unique Key (ID)
- DBMS가 준비되어 있어야 함
- 비정규화된 데이터가 있는 경우 multi-pass 임포트를 해야 함

(1) Neo4j Data Importer로 CSV 임포트

Neo4j Data Importer 는 graph app임
1백만 로우 이하의 중소형 데이터 로드에 유용
- 메모리 사용량이 높음
- 대형 데이터는 Cypher로 로드해야 함
import CSV files from your local system into the graph
- examine the CSV file headers
- map them to nodes and relationships in a Neo4j graph
CSV 파일 임포트에 운영 중인 Neo4j DBMS를 사용할 수 있음
Cypher를 몰라도 데이터를 로드해올 수 있음
List 필드는 스트링 타입으로 로드해온 뒤 후가공을 해야 함

Neo4j Data Importer 사용

Neo4j Data Importer 접속
sandbox site 접속 후 connection details 확인
- Websocket Bolt URL: bolt+s://a2ffc78f7066cee90cdb165026238166.neo4jsandbox.com:7687
- Username: neo4j
- Password: hardship-depots-breeze
- IP Address: 44.204.111.42
- HTTP Port: 7474
- Bolt Port: 7687
- Bolt URL: bolt://44.204.111.42:7687
ID가 될 칼럼을 지정해줘야 함
- "id"나 "Id"로 끝나는 모든 필드는 integers로 세팅되며,
- 자동으로 unique key 필드로 선택됨
데이터 타입을 지정해줄 수 있음

(2) Cypher로 CSV Import하기

Cypher로 임포트하게 되면 메모리 사용량을 컨트롤 할 수 있다
기본값은 싱글 트랜젝션이기 때문에, 대용량 CSV 임포팅을 위해서는 Cypher가 다중수행 될 수 있도록 작성해야 함
장점
- 다중 수행될 수 있도록 트랜젝션을 나눌 수 있음
- 임포팅과 동시에 리팩토링을 해 사후조작을 줄일 수 있음

Multi-pass import processing

절차

노드 생성
레이블 생성
관계 생성

노드 생성 `LOAD CSV WITH HEADERS FROM ... AS ...`

CALL {
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing/2-movieData.csv'
AS row
//process only Movie rows
WITH row WHERE row.Entity = "Movie"
MERGE (m:Movie {movieId: toInteger(row.movieId)})
ON CREATE SET
m.tmdbId = toInteger(row.tmdbId),
m.imdbId = toInteger(row.imdbId),
m.imdbRating = toFloat(row.imdbRating),
m.released = datetime(row.released),
m.title = row.title,
m.year = toInteger(row.year),
m.poster = row.poster,
m.runtime = toInteger(row.runtime),
m.countries = split(coalesce(row.countries,""), "|"),
m.imdbVotes = toInteger(row.imdbVotes),
m.revenue = toInteger(row.revenue),
m.plot = row.plot,
m.url = row.url,
m.budget = toInteger(row.budget),
m.languages = split(coalesce(row.languages,""), "|")
WITH m,split(coalesce(row.genres,""), "|") AS genres
UNWIND genres AS genre
WITH m, genre
MERGE (g:Genre {name:genre})
MERGE (m)-[:IN_GENRE]->(g)
}

`CASE`문으로 property type 가공

CALL {
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing/2-movieData.csv'
AS row
WITH row WHERE row.Entity = "Person"
MERGE (p:Person {tmdbId: toInteger(row.tmdbId)})
ON CREATE SET
p.imdbId = toInteger(row.imdbId),
p.bornIn = row.bornIn,
p.name = row.name,
p.bio = row.bio,
p.poster = row.poster,
p.url = row.url,
p.born = CASE row.born WHEN "" THEN null ELSE date(row.born) END,
p.died = CASE row.died WHEN "" THEN null ELSE date(row.died) END
}

Relation 생성

CALL {
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing/2-movieData.csv'
AS row
WITH row WHERE row.Entity = "Join" AND row.Work = "Acting"
MATCH (p:Person {tmdbId: toInteger(row.tmdbId)})
MATCH (m:Movie {movieId: toInteger(row.movieId)})
MERGE (p)-[r:ACTED_IN]->(m)
ON CREATE
SET r.role = row.role
SET p:Actor
}

Error

Neo.ClientError.Transaction.TransactionTimedOut
- 일부만 임포트 되어 발생하는 에러
- 단순히 rerun하면 됨

Imported data set 모델링 w/ Cypher

Property

property values are written as
- strings
- Longs (integer values)
- Doubles (decimal values)
- Datetimes
- Booleans
Transforming data types from string to multi-value list of strings
Adding the Actor and Director labels to the Person nodes
Adding more constraints per the graph data model
Creating the Genre nodes from the data in the Movie nodes

data type 조회하기

CALL apoc.meta.nodeTypeProperties()
YIELD nodeType, propertyName, propertyTypes

#질문❓ 일부 노드 프로퍼티만 조회하려면 어떻게?

relation type 조회하기

CALL apoc.meta.relTypeProperties()
YIELD relType, propertyName, propertyTypes

Person 노드의 born, died 프로퍼티 date로 변환하기 `date(n.property)`

SET ... = CASE... WHEN ... THEN ... ELSE .... END

MATCH (p:Person)
SET p.born = CASE p.born WHEN "" THEN null ELSE date(p.born) END

WITH p
SET p.died = CASE p.died WHEN "" THEN null ELSE date(p.died) END

다중값 property List로 변환

리스트 내의 값은 같은 타입이어야 함
coalesce(property,"string") returns an empty string if the entry in m.countries is null.
split(property,"string") identifies each element in the multi-value field where the "|" character is the separator and create a list of each element.

MATCH (m:Movie)
SET m.countries = split(coalesce(m.countries,""), "|")
SET m.languages = split(coalesce(m.languages,""), "|")
SET m.genres = split(coalesce(m.genres, ""), "|")
RETURN m

String ➡️ StringArray

Label

MATCH된 Node에 Label 추가하기

[:ACTED_IN]에 해당하는 Person Node에 Actor label 추가하기

MATCH (p:Person)-[a:ACTED_IN]->(m:Movie)
WITH DISTINCT p SET p:Actor

WITH DISTINCT를 하는 이유

count를 해보면
- count(p): 372
- count(a): 372
- count(m): 372
WITH DISTINCT ... RETURN count ...
- count(p): 353
- count(a): 372
- count(m): 93
  ➡️ 여러 영화에 출연한 Person 노드가 중복 리턴됨을 알 수 있음

property로 Node 생성하기

Adding a uniqueness constraint

A best practice is to always have a unique ID for every type of node in the graph
Having a uniqueness constraint defined helps with performance when creating nodes and also for queries.
The MERGE clause looks up nodes using the property value defined for the constraint. With a constraint, it is a quick lookup and if the node already exists, it is not created.

제약 조건 생성

CREATE CONSTRAINT Genre_name IF NOT EXISTS
FOR (x:Genre)
REQUIRE x.name IS UNIQUE

결과 확인

SHOW CONSTRAINT

id	name	type	entityType	labelsOrTypes	properties	ownedIndexId
10	"Genre_name"	"UNIQUENESS"	"NODE"	["Genre"]	["name"]	9

property ➡️ Nodes 및 RELATIONSHIP 및 생성

MATCH (m:Movie)
UNWIND m.genres AS genre
WITH m, genre
MERGE (g:Genre {name:genre})
MERGE (m)-[:IN_GENRE]->(g)

property 삭제

MATCH (m:Movie)
SET m.genres = null

schema 확인

CALL db.schema.visualization

삭제

노드 삭제 (`DETACH DELETE`)

MATCH (u:User) DETACH DELETE u;
MATCH (p:Person) DETACH DELETE p;
MATCH (m:Movie) DETACH DELETE m;
MATCH (n) DETACH DELETE n

레이블 삭제 (`REMOVE`)

MATCH (d:Director)
REMOVE d:Director

프로퍼티 삭제 (`SET ... = null`)

MATCH (m:Movie) 
SET m.genres = null

확인

스키마 확인

CALL db.schema.visualization

제약사항 확인

SHOW CONSTRAINTS

출처: https://graphacademy.neo4j.com/courses/importing-data/

저작자표시 비영리 변경금지 (새창열림)

[Neo4j 기초] CSV 파일 임포트해오기

CSV 파일 임포팅

데이터 임포트 방식

Neo4j 지원 데이터 타입

CSV 파일 임포팅

준비

주의사항

에러 해결

(1) Neo4j Data Importer로 CSV 임포트

Neo4j Data Importer 사용

(2) Cypher로 CSV Import하기

Multi-pass import processing

노드 생성 LOAD CSV WITH HEADERS FROM ... AS ...

CASE문으로 property type 가공

Relation 생성

Error

Imported data set 모델링 w/ Cypher

Property

data type 조회하기

relation type 조회하기

Person 노드의 born, died 프로퍼티 date로 변환하기 date(n.property)

다중값 property List로 변환

Label

MATCH된 Node에 Label 추가하기

property로 Node 생성하기

Adding a uniqueness constraint

property ➡️ Nodes 및 RELATIONSHIP 및 생성

property 삭제

schema 확인

삭제

노드 삭제 (DETACH DELETE)

레이블 삭제 (REMOVE)

프로퍼티 삭제 (SET ... = null)

확인

스키마 확인

제약사항 확인

노드 생성 `LOAD CSV WITH HEADERS FROM ... AS ...`

`CASE`문으로 property type 가공

Person 노드의 born, died 프로퍼티 date로 변환하기 `date(n.property)`

노드 삭제 (`DETACH DELETE`)

레이블 삭제 (`REMOVE`)

프로퍼티 삭제 (`SET ... = null`)