first commit

2025-02-20 14:57:11 +08:00
commit 687bda5ead
@@ -0,0 +1,138 @@
+We have a movie data set in JSON, Solr XML, and CSV formats.
+All 3 formats contain the same data.  You can use any one format to index documents to Solr.
+
+The data is fetched from Freebase and the data license is present in the films-LICENSE.txt file.
+
+This data consists of the following fields:
+ * "id" - unique identifier for the movie
+ * "name" - Name of the movie
+ * "directed_by" - The person(s) who directed the making of the film
+ * "initial_release_date" - The earliest official initial film screening date in any country
+ * "genre" - The genre(s) that the movie belongs to
+
+ Steps:
+   * Start Solr:
+       bin/solr start
+
+   * Create a "films" core:
+       bin/solr create -c films
+
+   * Set the schema on a couple of fields that Solr would otherwise guess differently (than we'd like) about:
+curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
+    "add-field" : {
+        "name":"name",
+        "type":"text_general",
+        "multiValued":false,
+        "stored":true
+    },
+    "add-field" : {
+        "name":"initial_release_date",
+        "type":"pdate",
+        "stored":true
+    }
+}'
+
+   * Now let's index the data, using one of these three commands:
+
+     - JSON: bin/post -c films example/films/films.json
+     - XML: bin/post -c films example/films/films.xml
+     - CSV: bin/post \
+                  -c films \
+                  example/films/films.csv \
+                  -params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
+
+   * Let's get searching!
+     - Search for 'Batman':
+       http://localhost:8983/solr/films/query?q=name:batman
+
+       * If you get an error about the name field not existing, you haven't yet indexed the data
+       * If you don't get an error, but zero results, chances are that the _name_ field schema type override wasn't set
+         before indexing the data the first time (it ended up as a "string" type, requiring exact matching by case even).
+         It's easiest to simply reset the environment and try again, ensuring that each step successfully executes.
+
+     - Show me all 'Super hero' movies:
+       http://localhost:8983/solr/films/query?q=*:*&fq=genre:%22Superhero%20movie%22
+
+     - Let's see the distribution of genres across all the movies. See the facet section of the response for the counts:
+       http://localhost:8983/solr/films/query?q=*:*&facet=true&facet.field=genre
+
+     - Browse the indexed films in a traditional browser search interface:
+       http://localhost:8983/solr/films/browse
+
+       Now browse including the genre field as a facet:
+       http://localhost:8983/solr/films/browse?facet.field=genre
+
+       If you want to set a facet for /browse to keep around for every request add the facet.field into the "facets"
+       param set (which the /browse handler is already configured to use):
+curl http://localhost:8983/solr/films/config/params -H 'Content-type:application/json'  -d '{
+"update" : {
+  "facets": {
+    "facet.field":"genre"
+    }
+  }
+}'
+
+        And now http://localhost:8983/solr/films/browse will display the _genre_ facet automatically.
+
+Exploring the data further - 
+
+  * Increase the MAX_ITERATIONS value, put in your freebase API_KEY and run the film_data_generator.py script using Python 3.
+    Now re-index Solr with the new data.
+
+FAQ:
+  Why override the schema of the _name_ and _initial_release_date_ fields?
+
+     Without overriding those field types, the _name_ field would have been guessed as a multi-valued string field type
+     and _initial_release_date_ would have been guessed as a multi-valued pdate type.  It makes more sense with this
+     particular data set domain to have the movie name be a single valued general full-text searchable field,
+     and for the release date also to be single valued.
+
+  How do I clear and reset my environment?
+
+      See the script below.
+
+  Is there an easy to copy/paste script to do all of the above?
+
+    Here ya go << END_OF_SCRIPT
+
+bin/solr stop
+rm server/logs/*.log
+rm -Rf server/solr/films/
+bin/solr start
+bin/solr create -c films
+curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
+    "add-field" : {
+        "name":"name",
+        "type":"text_general",
+        "multiValued":false,
+        "stored":true
+    },
+    "add-field" : {
+        "name":"initial_release_date",
+        "type":"pdate",
+        "stored":true
+    }
+}'
+bin/post -c films example/films/films.json
+curl http://localhost:8983/solr/films/config/params -H 'Content-type:application/json'  -d '{
+"update" : {
+  "facets": {
+    "facet.field":"genre"
+    }
+  }
+}'
+
+# END_OF_SCRIPT
+
+Additional fun -
+
+Add highlighting:
+curl http://localhost:8983/solr/films/config/params -H 'Content-type:application/json'  -d '{
+"set" : {
+  "browse": {
+    "hl":"on",
+    "hl.fl":"name"
+    }
+  }
+}'
+try http://localhost:8983/solr/films/browse?q=batman now, and you'll see "batman" highlighted in the results
@@ -0,0 +1,117 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This will generate a movie data set of 1100 records.
+These are the first 1100 movies which appear when querying the Freebase of type '/film/film'.
+Here is the link to the freebase page - https://www.freebase.com/film/film?schema=
+
+Usage - python3 film_data_generator.py
+"""
+
+import csv
+import copy
+import json
+import codecs
+import datetime
+import urllib.parse
+import urllib.request
+import xml.etree.cElementTree as ET
+from xml.dom import minidom
+
+MAX_ITERATIONS=10  #10 limits it to 1100 docs
+
+# You need an API Key by Google to run this
+API_KEY = '<insert your Google developer API key>'
+service_url = 'https://www.googleapis.com/freebase/v1/mqlread'
+query = [{
+  "id": None,
+  "name": None,
+  "initial_release_date": None,
+  "directed_by": [],
+  "genre": [],
+  "type": "/film/film",
+  "initial_release_date>" : "2000"
+}]
+
+def gen_csv(filmlist):
+  filmlistDup = copy.deepcopy(filmlist)
+  #Convert multi-valued to % delimited string
+  for film in filmlistDup:
+      for key in film:
+        if isinstance(film[key], list):
+          film[key] = '|'.join(film[key])
+  keys = ['name', 'directed_by', 'genre', 'type', 'id', 'initial_release_date']
+  with open('films.csv', 'w', newline='', encoding='utf8') as csvfile:
+    dict_writer = csv.DictWriter(csvfile, keys)
+    dict_writer.writeheader()
+    dict_writer.writerows(filmlistDup)
+
+def gen_json(filmlist):
+  filmlistDup = copy.deepcopy(filmlist)
+  with open('films.json', 'w') as jsonfile:
+    jsonfile.write(json.dumps(filmlist, indent=2))
+
+def gen_xml(filmlist):
+  root = ET.Element("add")
+  for film in filmlist:
+    doc = ET.SubElement(root, "doc")
+    for key in film:
+      if isinstance(film[key], list):
+        for value in film[key]:
+          field = ET.SubElement(doc, "field")
+          field.set("name", key)
+          field.text=value
+      else:
+        field = ET.SubElement(doc, "field")
+        field.set("name", key)
+        field.text=film[key]
+  tree = ET.ElementTree(root)
+  with open('films.xml', 'w') as f:
+    f.write( minidom.parseString(ET.tostring(tree.getroot(),'utf-8')).toprettyxml(indent="  ") )
+
+def do_query(filmlist, cursor=""):
+  params = {
+          'query': json.dumps(query),
+          'key': API_KEY,
+          'cursor': cursor
+  }
+  url = service_url + '?' + urllib.parse.urlencode(params)
+  data = urllib.request.urlopen(url).read().decode('utf-8')
+  response = json.loads(data)
+  for item in response['result']:
+    del item['type'] # It's always /film/film. No point of adding this.
+    try:
+      datetime.datetime.strptime(item['initial_release_date'], "%Y-%m-%d")
+    except ValueError:
+      #Date time not formatted properly. Keeping it simple by removing the date field from that doc
+      del item['initial_release_date']
+    filmlist.append(item)
+  return response.get("cursor")
+
+
+if __name__ == "__main__":
+  filmlist = []
+  cursor = do_query(filmlist)
+  i=0
+  while(cursor):
+      cursor = do_query(filmlist, cursor)
+      i = i+1
+      if i==MAX_ITERATIONS:
+          break
+
+  gen_json(filmlist)
+  gen_csv(filmlist)
+  gen_xml(filmlist)
@@ -0,0 +1,3 @@
+The films data (films.json/.xml/.csv) is licensed under the Creative Commons Attribution 2.5 Generic License.
+To view a copy of this license, visit http://creativecommons.org/licenses/by/2.5/
+or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.